In this lab, I will be building out a wide and deep neural network that will classify the edibility of a mushroom as poisonous or edible. I am using the UCI Machine Learning Mushroom data set which contains a grand total of 8,124 instances. After building out 3 wide and deep networks I will then compare the performance of them and how the best one compares to a multilayer perceptron.
The data set consists of several thousand mushroom records drawn from The Audubon Society Field Guide to North American Mushrooms. It includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family. Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one.
Link To Dataset - https://www.kaggle.com/uciml/mushroom-classification
It's a temperate Sunday afternoon and you are about to set out on your weekly walk through the forest to unwind and enjoy the great outdoors. You grab your trusty walking stick and a light jacket and head out to begin your journey. An hour passes and as you walking you suddenly realize you forgot to pick up those mushrooms you needed to make your famous mushroom casserole for dinner. However, lone behold, right in front of you is a patch of some beautiful mushrooms. You grab a couple, head back home, and make your casserole. However, what you didn't know was those mushrooms were poisonous and luckily you make it to the ER before its too late.
How amazing would it have been if all of that could have been avoided with a simple and easy to use app? Not only would you have not ended up in the hospital but you could have thoroughly enjoyed your mushroom casserole! Mushroom Hunting is a real activity and tens to hundreds of thousands of people partake. It is most popular in Europe, Australia, Japan, but most importantly, temperate regions of the United States and Canada. The data set is collected pertains to mushrooms on the North American continent, thus we have a target audience and potential consumers. [1]
In addition to the above concern, determining whether a mushroom is poisonous or not could be immensely useful in finding out if what your pet or child just ate could prove to be fatally harmful to them. Using an app to quickly check the mushroom and determine its edibility could be used to help indicate whether immediate action is required or not.
This is where the motivation stems from. I wish to use this mushroom data to construct a neural network that specializes in predicting whether a mushroom is poisonous based on its features.
As stated above, the motivation behind using this data set was to construct a neural network capable of sufficiently predicting the edibility of a mushroom given its characteristics. That means we are looking for an extremely well-performing network that could then be loaded into an app that would aid users in finding out if that mushroom they just saw on their walk is in fact poisonous.
With the increasing adoption of eating natural and healthy, this app would make a beneficial contribution to that trend. People nowadays want to eat things that are grown naturally and free of the scary chemicals some corporations pump into their foods to save money. Just like how we have for the past several centuries. The increasing concern that these processed and artificial foods are actually detrimental to our health is helping a lot of people overcome prior medical issues being caused by these foods. I for one can say myself, after quitting drinking soda, I feel amazing and can't believe I drank soda for all those years.
Overall, our primary objective here is to classify the test data set correctly given our metric for evaluation stated below.
In this lab, I plan to perform the following feature crossings for the wide portion of the network. I decided to choose the following crosses as I felt that they made sense and were logical to combine...
For architecture 1 and 2 I chose to cross the following features...
The reason I chose to cross these features is because they seem like fairly easily identifiable characteristics for other humans. The stalk colors on a given mushroom are most likely fairly close to one another so I decided to combine these features. The same can be said about the gill and veil colors, these are easily identifiable and most likely fairly consistent according to class. Lastly, I crossed odor and habitat as I saw it likely that a particular habitat will go hand in hand with a particular odor, or at least most of the time.
For architecture 3, I chose to cross the following features...
The reason I chose to cross the features is because, like above, they seem fairly similar to one another. The population, habitat, and odor features are crossed as they seem like more meta-related information about the mushroom than actually describing it and could be useful. The habitat and bruises are crossed as well because the condition of the mushroom and whether it has bruises is primarily attributed to its environment, hence, habitat.
For this particular objective, I am classifying whether a mushroom is poisonous or not based on its features so a user could determine if their own mushroom is safe for consumption. Thus, I am very concerned with predicting false negatives, which would mean that I predicted a mushroom as edible when it was in fact poisonous. The reason behind this is quite self-explanatory, should a user eat a poisonous mushroom it could prove to be fatal. Whenever we are dealing with humans and fatality could be involved, our metric is extremely important in determining if a model is good or bad. We do not want a model that will occasionally predict false negatives. Therefore, given the objective, our metric for evaluation will be recall. I chose to use recall due to the fact that it is used to determine how well a model avoids false negatives. Thus we are looking for a recall rate of 100% or extremely close to it so we are never predicting if a mushroom is edible when it is actually poisonous.
Below, I begin by preparing the data to be used by the model later on for training/testing. For this particular lab, I chose to use the UCI Mushroom Data Set and predict whether a given mushroom is poisonous or edible based on its characteristics as seen below. For this dataset, we are dealing with entirely categorical data. There are no numerical features thus all our data must be integer encoded (which we will do here) prior to training or classifying. The steps for processing the data are as follows:
Target Classes:
Data Classes: (this data set is purely categorical)
import pandas as pd
import numpy as np
# plotly
import plotly
import plotly.graph_objects as go
from plotly.graph_objs import Scatter, Marker, Layout, XAxis, YAxis, Bar, Line
plotly.offline.init_notebook_mode()
# matplotlib
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
column_names = ['class',
'cap-shape',
'cap-surface',
'cap-color',
'bruises',
'odor',
'gill-attachment',
'gill-spacing',
'gill-size',
'gill-color',
'stalk-shape',
'stalk-root',
'stalk-surface-above-ring',
'stalk-surface-below-ring',
'stalk-color-above-ring',
'stalk-color-below-ring',
'veil-type',
'veil-color',
'ring-number',
'ring-type',
'spore-print-color',
'population',
'habitat']
mushrooms = pd.read_csv('SHROOM/agaricus-lepiota.data', header=None, names=column_names)
data = mushrooms
data.shape
data.head()
data.describe()
data['stalk-root'].unique()
mushrooms['stalk-root'].hist()
We see here that the 'stalk-root' feature has a '?' classification to it. Due to this, we will simply remove the whole column as more than half the dataset has a '?' for that feature.
data.drop(['stalk-root'], inplace=True, axis=1)
data_counts = data.apply(pd.value_counts).fillna(0)
data_counts
data_counts_plt = data_counts.T.plot(kind='bar',stacked=True, legend=True, figsize=(14,8), fontsize=18)
data_counts_plt.legend(loc=(1.04,0))
data
data.columns
To split the data into training and testing data I used a 80/20 stratified shuffle split accross 4 folds. I chose to use stratified shuffle split because the data set is fairly small ~8000 instances. For neural networks it is better to use as much data as you can (10s of thousands typically) but our data set should be fine so long as we split it correctly. Thus, using sss allows for the dataset to be evenly split accross features (the value counts of them that is) which is something we want in order to maintain a good ratio. You can see the operation performed below...
from sklearn.model_selection import StratifiedShuffleSplit
train_set = []
test_set = []
# get train and test split
sss = StratifiedShuffleSplit(n_splits=4, test_size=0.2, random_state=42)
for train_index, test_index in sss.split(data, data['class']):
train_set.append(data.loc[train_index].reset_index(drop=True))
test_set.append(data.loc[test_index].reset_index(drop=True))
train_set[0].shape
test_set[0].shape
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler
feature_columns = [x+'_int' for x in data.columns]
feature_columns.remove('class_int')
for row in range(len(train_set)):
encoders = dict()
for col in data.columns:
train_set[row][col] = train_set[row][col].str.strip()
test_set[row][col] = test_set[row][col].str.strip()
if col == 'class':
tmp = LabelEncoder()
train_set[row][col] = tmp.fit_transform(train_set[row][col])
test_set[row][col] = tmp.transform(test_set[row][col])
else:
encoders[col] = LabelEncoder()
train_set[row][col+'_int'] = encoders[col].fit_transform(train_set[row][col])
test_set[row][col+'_int'] = encoders[col].fit_transform(test_set[row][col])
train_set[0].head()
test_set[0].head()
# this is where we get out final train and test datasets
y_train = []
X_train = []
y_test = []
X_test = []
for row in range(len(train_set)):
y_train.append(train_set[row]['class'].values.astype(np.int))
y_test.append(test_set[row]['class'].values.astype(np.int))
X_train.append(train_set[row].drop(['class'], axis=1))
X_test.append(test_set[row].drop(['class'], axis=1))
import keras
keras.__version__
# keras models and layers
from keras.models import Model, Sequential
from keras.layers import Dense, Activation, Input, Dropout, Embedding, Flatten, concatenate
# sklearn encoder, recall, confusion matrix
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.metrics import recall_score, confusion_matrix, accuracy_score, roc_curve, auc
# graph visualization
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot
# define function that prints out the stats
def printStats(history_list,arch):
acc = []
loss = []
val_acc = []
val_loss = []
for hist in history_list:
acc.append(hist.history['accuracy'])
loss.append(hist.history['loss'])
val_acc.append(hist.history['val_accuracy'])
val_loss.append(hist.history['val_loss'])
acc_avg_list = np.mean(acc, axis=0)
loss_avg_list = np.mean(loss, axis=0)
val_acc_avg_list = np.mean(val_acc, axis=0)
val_loss_avg_list = np.mean(val_loss, axis=0)
vals_list = {
'acc': acc_avg_list,
'loss': loss_avg_list,
'val_acc':val_acc_avg_list,
'val_loss': val_loss_avg_list
}
acc_avg = np.average(acc)
loss_avg = np.average(loss)
val_acc_avg = np.average(val_acc)
val_loss_avg = np.average(val_loss)
vals = {
'acc': acc_avg,
'loss': loss_avg,
'val_acc':val_acc_avg,
'val_loss': val_loss_avg
}
print("ARCHITECTURE {} - ".format(arch))
print("Average accuracy: {}".format(acc_avg))
print("Average loss: {}".format(loss_avg))
print("Average validation accuracy: {}".format(val_acc_avg))
print("Average validation loss: {}".format(val_loss_avg))
return vals,vals_list
def printStats_legacy(history_list,arch):
acc = []
loss = []
val_acc = []
val_loss = []
for hist in history_list:
acc.append(hist.history['acc'])
loss.append(hist.history['loss'])
val_acc.append(hist.history['val_acc'])
val_loss.append(hist.history['val_loss'])
acc_avg = np.average(acc)
loss_avg = np.average(loss)
val_acc_avg = np.average(val_acc)
val_loss_avg = np.average(val_loss)
print("ARCHITECTURE {} - ".format(arch))
print("Average accuracy: {}".format(acc_avg))
print("Average loss: {}".format(loss_avg))
print("Average validation accuracy: {}".format(val_acc_avg))
print("Average validation loss: {}".format(val_loss_avg))
return acc_avg,loss_avg,val_acc_avg,val_loss_avg
For my first architecture I chose to use several crossed columns for the wide and a few deep emebedings.
For the wide portion of the network I decided to 6 columns to have a total of 3 crossed columns to be used. As seen below, I chose to cross features that would be easy for a user to distinguish and also features that seemed like they could go together. For example, I chose to cross the stalk color below and above the ring as these could be very similar depending on the mushroom. Same goes for the gill color. Odor and habitat i figured would be interesting to cross as well as there are smells we typically recognize when in a particular biome (i.e swamp).
For the deep portion of the network, I decided to use dropout in between the 3 layers to prevent overfitting of the data. I chose to use 20% and 40% for the 1st and 2nd dropout respectively. This means that in the first layer 20% of the neurons will not be updated, and in the second layer 40% will not be updated. I also used a layer size that shrunk as it got deeper. I started with 128 neurons in the first layer, 64 in the second, and 16 in the third.
# specify crossed columns
cross_columns = [['stalk-color-below-ring','stalk-color-above-ring'],
['gill-color','veil-color'],
['odor','habitat']]
# global model variables
X_train_master = []
X_test_master = []
all_inputs = []
# wide branch variables
all_wide_branch_outputs = []
N_list = []
run_once = True
# obtain integer encoded crossed embedings and data
for i in range(len(X_train)):
X_ints_train = []
X_ints_test = []
for cols in cross_columns:
enc = LabelEncoder()
X_crossed_train = X_train[i][cols].apply(lambda x: '_'.join(x), axis=1)
X_crossed_test = X_test[i][cols].apply(lambda x: '_'.join(x), axis=1)
enc.fit(np.hstack((X_crossed_train.values, X_crossed_test.values)))
X_crossed_train = enc.transform(X_crossed_train)
X_crossed_test = enc.transform(X_crossed_test)
X_ints_train.append( X_crossed_train )
X_ints_test.append( X_crossed_test )
if(run_once):
N_list.append(max(X_ints_train[-1]+1))
for col in feature_columns:
X_ints_train.append(X_train[i][col].values )
X_ints_test.append(X_test[i][col].values )
X_train_master.append(X_ints_train)
X_test_master.append(X_ints_test)
run_once = False
# create the wide branches
for i,n in enumerate(N_list):
inputs = Input(shape=(1,),dtype='int32', name = '_'.join(cross_columns[i]))
all_inputs.append(inputs)
x = Embedding(input_dim=n,
output_dim=int(np.sqrt(n)),
input_length=1, name = '_'.join(cross_columns[i])+'_embed')(inputs)
x = Flatten()(x)
all_wide_branch_outputs.append(x)
# merge the wide branches together
wide_branch = concatenate(all_wide_branch_outputs, name='wide_concat')
wide_branch = Dense(units=1,activation='sigmoid',name='wide_combined')(wide_branch)
# deep branch variables
all_deep_branch_outputs = []
# create the deep embedings
for col in feature_columns:
N = max(X_train[0][col]+1)
inputs = Input(shape=(1,),dtype='int32', name=col)
all_inputs.append(inputs)
x = Embedding(input_dim=N,
output_dim=int(np.sqrt(N)),
input_length=1, name=col+'_embed')(inputs)
x = Flatten()(x)
all_deep_branch_outputs.append(x)
# merge the deep branches together
deep_branch = concatenate(all_deep_branch_outputs)
deep_branch = Dense(units=128,activation='relu')(deep_branch)
deep_branch = Dropout(0.2, seed=32)(deep_branch)
deep_branch = Dense(units=64,activation='relu')(deep_branch)
deep_branch = Dropout(0.4, seed=32)(deep_branch)
deep_branch = Dense(units=16,activation='relu')(deep_branch)
# merge wide and deep together
final_branch = concatenate([wide_branch, deep_branch],name='concat_deep_wide')
final_branch = Dense(units=1,activation='sigmoid',name='combined')(final_branch)
# gen model
model1 = Model(inputs=all_inputs, outputs=final_branch)
# you will need to install pydot properly on your machine to get this running
SVG(model_to_dot(model1).create(prog='dot', format='svg'))
%%time
# compile our model
model1.compile(optimizer='adagrad',
loss='mean_squared_error',
metrics=['accuracy'])
# save model (4 folds)
model1.save_weights('model_1_weights.h5')
recall_list = []
conf_matrix_list = []
pred_list = []
history_list = []
# loop through folds and fit the data on each fold
for i in range(len(X_train_master)):
# reset the model
model1.load_weights('model_1_weights.h5')
history_list.append(model1.fit( X_train_master[i],
y_train[i],
epochs=5,
batch_size=32,
verbose=1,
validation_data = (X_test_master[i], y_test[i])
))
pred_list.append(np.round(model1.predict(X_test_master[i])))
conf_matrix_list.append(confusion_matrix(y_test[i],pred_list[i]))
recall_list.append(recall_score(y_test[i],pred_list[i]))
print("-- CONFUSION MATRIX\n {} --".format(conf_matrix_list[i]))
print("-- RECALL {} --".format(recall_list[i]))
# printStats(history_list, "ARCHITECTURE 1")
arch1_vals_list,arch1_mean_list = printStats(history_list, "ARCHITECTURE 1")
arch1_recall = np.average(recall_list)
print("AVG Recall: {}".format(arch1_recall))
plt.figure(figsize=(10,4))
plt.subplot(2,2,1)
plt.plot(arch1_mean_list['acc'])
plt.ylabel('Accuracy %')
plt.title('Training')
plt.subplot(2,2,2)
plt.plot(arch1_mean_list['val_acc'])
plt.title('Validation')
plt.subplot(2,2,3)
plt.plot(arch1_mean_list['loss'])
plt.ylabel('Cross Entropy Training Loss')
plt.xlabel('epochs')
plt.subplot(2,2,4)
plt.plot(arch1_mean_list['val_loss'])
plt.xlabel('epochs')
# obtain ROC values for architecture 1
y_pred1 = model1.predict(X_test_master[0])
#false positve and true postive rates using roc
fpr_1, tpr_1, thresholds_1 = roc_curve(y_test[0], y_pred1)
#area under the curve
auc_1 = auc(fpr_1, tpr_1)
As seen above, the recall score was just shy of being 100% on average, however, in 2 of the folds we actually did see a score of 100%. This is great to see as (ideally) we really don't want to have any false negatives in our results as it could lead to a potentially fatal outcome.
One other thing that can be seen is that there may be some over-training going on here. In the next architecture I will attempt to fix that by changing up the model.
In my second architecture I chose to use a similar structure to that of my first architecture with a few changes to try an prevent overfitting of the data.
The wide portion of this architecture should function exactly as the architecture 1. I decided to keep the same cross embedings as they seemed to make sense and work well.
For the deep portion of the network I use 2 deep layers and 2 dropout layers. The first deep layer contained 128 neurons which was then followed by a 40% dropout. The second deep layer contained 64 neurons which was followed by a 60% dropout.
I hope to see improvement in the average loss & validation loss scores such that they are a bit more similar in magnitude.
cross_columns = [['stalk-color-below-ring','stalk-color-above-ring'],
['gill-color','veil-color'],
['odor','habitat']]
# global model variables
X_train_master = []
X_test_master = []
all_inputs = []
# wide branch variables
all_wide_branch_outputs = []
N_list = []
run_once = True
# obtain integer encoded crossed embedings and data
for i in range(len(X_train)):
X_ints_train = []
X_ints_test = []
for cols in cross_columns:
enc = LabelEncoder()
X_crossed_train = X_train[i][cols].apply(lambda x: '_'.join(x), axis=1)
X_crossed_test = X_test[i][cols].apply(lambda x: '_'.join(x), axis=1)
enc.fit(np.hstack((X_crossed_train.values, X_crossed_test.values)))
X_crossed_train = enc.transform(X_crossed_train)
X_crossed_test = enc.transform(X_crossed_test)
X_ints_train.append( X_crossed_train )
X_ints_test.append( X_crossed_test )
if(run_once):
N_list.append(max(X_ints_train[-1]+1))
for col in feature_columns:
X_ints_train.append(X_train[i][col].values )
X_ints_test.append(X_test[i][col].values )
X_train_master.append(X_ints_train)
X_test_master.append(X_ints_test)
run_once = False
# create the wide branches
for i,n in enumerate(N_list):
# create embedding branch from the number of categories cross_columns[i]
inputs = Input(shape=(1,),dtype='int32', name = '_'.join(cross_columns[i]))
all_inputs.append(inputs)
x = Embedding(input_dim=n,
output_dim=int(np.sqrt(n)),
input_length=1, name = '_'.join(cross_columns[i])+'_embed')(inputs)
x = Flatten()(x)
all_wide_branch_outputs.append(x)
# merge the wide branches together
wide_branch = concatenate(all_wide_branch_outputs, name='wide_concat')
wide_branch = Dense(units=1,activation='sigmoid',name='wide_combined')(wide_branch)
# deep branch variables
all_deep_branch_outputs = []
# create the deep embedings
for col in feature_columns:
# get the number of categories
N = max(X_train[0][col]+1) # same as the max(df_train[col])
# create embedding branch from the number of categories
inputs = Input(shape=(1,),dtype='int32', name=col)
all_inputs.append(inputs)
x = Embedding(input_dim=N,
output_dim=int(np.sqrt(N)),
input_length=1, name=col+'_embed')(inputs)
x = Flatten()(x)
all_deep_branch_outputs.append(x)
# merge the deep branches together
deep_branch = concatenate(all_deep_branch_outputs)
deep_branch = Dense(units=128,activation='relu')(deep_branch)
deep_branch = Dropout(0.4, seed=32)(deep_branch)
deep_branch = Dense(units=64,activation='relu')(deep_branch)
deep_branch = Dropout(0.6, seed=32)(deep_branch)
# merge wide and deep together
final_branch = concatenate([wide_branch, deep_branch],name='concat_deep_wide')
final_branch = Dense(units=1,activation='sigmoid',name='combined')(final_branch)
# gen model
model2 = Model(inputs=all_inputs, outputs=final_branch)
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot
# you will need to install pydot properly on your machine to get this running
SVG(model_to_dot(model2).create(prog='dot', format='svg'))
%%time
# compile our model
model2.compile(optimizer='adagrad',
loss='mean_squared_error',
metrics=['accuracy'])
# save model (4 folds)
model2.save_weights('model_2_weights.h5')
recall_list_2 = []
conf_matrix_list_2 = []
pred_list_2 = []
history_list_2 = []
# loop through folds and fit the data on each fold
for i in range(len(X_train_master)):
# reset the model
model2.load_weights('model_2_weights.h5')
history_list_2.append(model2.fit( X_train_master[i],
y_train[i],
epochs=5,
batch_size=32,
verbose=1,
validation_data = (X_test_master[i], y_test[i])
))
pred_list_2.append(np.round(model2.predict(X_test_master[i])))
conf_matrix_list_2.append(confusion_matrix(y_test[i],pred_list_2[i]))
recall_list_2.append(recall_score(y_test[i],pred_list_2[i]))
print("-- CONFUSION MATRIX\n {} --".format(conf_matrix_list_2[i]))
print("-- RECALL {} --".format(recall_list_2[i]))
# printStats_v2_4(history_list, "ARCHITECTURE 2")
arch2_vals_list,arch2_mean_list = printStats(history_list_2, "ARCHITECTURE 2")
arch2_recall = np.average(recall_list_2)
print("AVG Recall: {}".format(arch2_recall))
plt.figure(figsize=(10,4))
plt.subplot(2,2,1)
plt.plot(arch2_mean_list['acc'])
plt.ylabel('Accuracy %')
plt.title('Training')
plt.subplot(2,2,2)
plt.plot(arch2_mean_list['val_acc'])
plt.title('Validation')
plt.subplot(2,2,3)
plt.plot(arch2_mean_list['loss'])
plt.ylabel('Cross Entropy Training Loss')
plt.xlabel('epochs')
plt.subplot(2,2,4)
plt.plot(arch2_mean_list['val_loss'])
plt.xlabel('epochs')
# obtain ROC values for architecture 1
y_pred2 = model2.predict(X_test_master[0])
#false positve and true postive rates using roc
fpr_2, tpr_2, thresholds_2 = roc_curve(y_test[0], y_pred2)
#area under the curve
auc_2 = auc(fpr_2, tpr_2)
In this model, the recall score was just shy of being 100% on average but was improved from the previous architecture by 0.3%. The change is not that much but it is still better than before and we seriously want to ensure that the score is as close to 100% as possible.
Unfortunately our losses compared the the first model are pretty much exactly the same in terms of ratio. This is unfortunate as it could signify there is still over-training occuring.
In my third architecture I plan to use a similar architecture as architecture 2 but with a changeup in the crossed-features.
For the wide portion of my network I will now be crossing population-habitat-odor and bruises-habitat. The reason I chose to cross these like this is because there could be a strong relation for more easily identifyable characteristics that could improve classification score. Some of the characteristics in the data set are quite difficult to classify like 'gill-spacing' or any of the characteristics that require touch. If someone were to be strolling through the forest in search of a tasty mushroom, they could more easily generalize an area of mushrooms by these features without getting really close (think of an app and you see a bunch of mushrooms). I also crossed bruises and habitat to see if there is a strong correlation there as well.
For the deep portion of the network, I brought down the first dropout rate from 40% to 30%, leaving the second at 60%, and I also reduced the neurons used in the first and second networks to 100 and 50.
cross_columns = [['population','habitat','odor'],
['bruises','habitat']]
# global model variables
X_train_master = []
X_test_master = []
all_inputs = []
# wide branch variables
all_wide_branch_outputs = []
N_list = []
run_once = True
# obtain integer encoded crossed embedings and data
for i in range(len(X_train)):
X_ints_train = []
X_ints_test = []
for cols in cross_columns:
enc = LabelEncoder()
X_crossed_train = X_train[i][cols].apply(lambda x: '_'.join(x), axis=1)
X_crossed_test = X_test[i][cols].apply(lambda x: '_'.join(x), axis=1)
enc.fit(np.hstack((X_crossed_train.values, X_crossed_test.values)))
X_crossed_train = enc.transform(X_crossed_train)
X_crossed_test = enc.transform(X_crossed_test)
X_ints_train.append( X_crossed_train )
X_ints_test.append( X_crossed_test )
if(run_once):
N_list.append(max(X_ints_train[-1]+1))
for col in feature_columns:
X_ints_train.append(X_train[i][col].values )
X_ints_test.append(X_test[i][col].values )
X_train_master.append(X_ints_train)
X_test_master.append(X_ints_test)
run_once = False
# create the wide branches
for i,n in enumerate(N_list):
# create embedding branch from the number of categories cross_columns[i]
inputs = Input(shape=(1,),dtype='int32', name = '_'.join(cross_columns[i]))
all_inputs.append(inputs)
x = Embedding(input_dim=n,
output_dim=int(np.sqrt(n)),
input_length=1, name = '_'.join(cross_columns[i])+'_embed')(inputs)
x = Flatten()(x)
all_wide_branch_outputs.append(x)
# merge the wide branches together
wide_branch = concatenate(all_wide_branch_outputs, name='wide_concat')
wide_branch = Dense(units=1,activation='sigmoid',name='wide_combined')(wide_branch)
# deep branch variables
all_deep_branch_outputs = []
# create the deep embedings
for col in feature_columns:
# get the number of categories
N = max(X_train[0][col]+1) # same as the max(df_train[col])
# create embedding branch from the number of categories
inputs = Input(shape=(1,),dtype='int32', name=col)
all_inputs.append(inputs)
x = Embedding(input_dim=N,
output_dim=int(np.sqrt(N)),
input_length=1, name=col+'_embed')(inputs)
x = Flatten()(x)
all_deep_branch_outputs.append(x)
# merge the deep branches together
deep_branch = concatenate(all_deep_branch_outputs)
deep_branch = Dense(units=100,activation='relu')(deep_branch)
deep_branch = Dropout(0.3, seed=32)(deep_branch)
deep_branch = Dense(units=50,activation='relu')(deep_branch)
deep_branch = Dropout(0.6, seed=32)(deep_branch)
# merge wide and deep together
final_branch = concatenate([wide_branch, deep_branch],name='concat_deep_wide')
final_branch = Dense(units=1,activation='sigmoid',name='combined')(final_branch)
# gen model
model3 = Model(inputs=all_inputs, outputs=final_branch)
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot
# you will need to install pydot properly on your machine to get this running
SVG(model_to_dot(model3).create(prog='dot', format='svg'))
%%time
# compile our model
model3.compile(optimizer='adagrad',
loss='mean_squared_error',
metrics=['accuracy'])
# save model (4 folds)
model3.save_weights('model_3_weights.h5')
recall_list_3 = []
conf_matrix_list_3 = []
pred_list_3 = []
history_list_3 = []
# loop through folds and fit the data on each fold
for i in range(len(X_train_master)):
# reset the model
model3.load_weights('model_3_weights.h5')
history_list_3.append(model3.fit( X_train_master[i],
y_train[i],
epochs=7,
batch_size=32,
verbose=1,
validation_data = (X_test_master[i], y_test[i])
))
pred_list_3.append(np.round(model3.predict(X_test_master[i])))
conf_matrix_list_3.append(confusion_matrix(y_test[i],pred_list_3[i]))
recall_list_3.append(recall_score(y_test[i],pred_list_3[i]))
print("-- CONFUSION MATRIX\n {} --".format(conf_matrix_list_3[i]))
print("-- RECALL {} --".format(recall_list_3[i]))
# printStats_v2_4(history_list, "ARCHITECTURE 2")
arch3_vals_list,arch3_mean_list = printStats(history_list_3, "ARCHITECTURE 3")
plt.figure(figsize=(10,4))
plt.subplot(2,2,1)
plt.plot(arch2_mean_list['acc'])
plt.ylabel('Accuracy %')
plt.title('Training')
plt.subplot(2,2,2)
plt.plot(arch2_mean_list['val_acc'])
plt.title('Validation')
plt.subplot(2,2,3)
plt.plot(arch2_mean_list['loss'])
plt.ylabel('Cross Entropy Training Loss')
plt.xlabel('epochs')
plt.subplot(2,2,4)
plt.plot(arch2_mean_list['val_loss'])
plt.xlabel('epochs')
arch3_recall = np.average(recall_list_3)
print("AVG Recall: {}".format(arch3_recall))
plt.figure(figsize=(10,4))
plt.subplot(2,2,1)
plt.plot(arch3_mean_list['acc'])
plt.ylabel('Accuracy %')
plt.title('Training')
plt.subplot(2,2,2)
plt.plot(arch3_mean_list['val_acc'])
plt.title('Validation')
plt.subplot(2,2,3)
plt.plot(arch3_mean_list['loss'])
plt.ylabel('Cross Entropy Training Loss')
plt.xlabel('epochs')
plt.subplot(2,2,4)
plt.plot(arch3_mean_list['val_loss'])
plt.xlabel('epochs')
# obtain ROC values for architecture 1
y_pred3 = model3.predict(X_test_master[0])
#false positve and true postive rates using roc
fpr_3, tpr_3, thresholds_3 = roc_curve(y_test[0], y_pred3)
#area under the curve
auc_3 = auc(fpr_3, tpr_3)
In this model there was an improvement in recall score but the loss was approximately the same. Therefore there could be some overfitting occuring in this model. That being said though, we were still able hit better percentages than the second model in earlier epochs.
This model is still the best we have seen yet.
print("AVG Model 1 Recall: {}".format(arch1_recall))
print("AVG Model 2 Recall: {}".format(arch2_recall))
print("AVG Model 3 Recall: {}".format(arch3_recall))
plt.figure(figsize=(12,12))
#plot halfway line
plt.plot([0,1], [0,1], 'k--')
#plot model 1 ROC
plt.plot(fpr_1, tpr_1, label='Model 1 (area = {:.3f})'.format(auc_1))
#plot model 2 ROC
plt.plot(fpr_2, tpr_2, label='Model 2 (area = {:.3f})'.format(auc_2))
#plot model 3 ROC
plt.plot(fpr_3, tpr_3, label='Model 3 (area = {:.3f})'.format(auc_3))
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('All Wide and Deep ROC curves')
plt.legend(loc='best')
plt.show()
In the ROC plot seen above, all three models perform extremely well. They almost have a perfect prediction for a given mushroom which is really good to see! However, we were expecting to achieve really good results with this dataset. It was very well maintained and had a lot of data to learn from, thus, it makes sense that is can accurately predict mushrooms really well. All three models also perform essentially the same, in fact, they all share the exact same AUC. However, we do see a slight difference in the evaluation criteria we chose for these models. Model 3 actually ends up becoming the best model since it is the exact same as model 2 but has a higher recall score by ~0.04%. Thus, Model 3 is our best wide and deep model.
Now I will compare the best wide and deep network to sklearns MLP. In this case, the third model was the best as it had the best ROC and Recall score.
X_train_ints = []
X_test_ints = []
for i in range(len(X_train)):
X_train_ints.append(X_train[i][feature_columns])
X_test_ints.append(X_test[i][feature_columns])
from sklearn.neural_network import MLPClassifier
yhat_list = []
mlp_recall_list = []
mlp = MLPClassifier(hidden_layer_sizes=(6,),
learning_rate_init=0.01,
random_state=1,
activation='relu')
for i in range(len(X_train)):
mlp.fit(X_train_ints[i], y_train[i])
yhat_list.append(mlp.predict(X_test_ints[i]))
mlp_recall_list.append(recall_score(y_test[i], yhat_list[i]))
print("MLP Recall Score: ", mlp_recall_list[i])
print("MLP Accuracy Score: ", accuracy_score(y_test[i], yhat_list[i]))
mlp_avg_recall = np.average(mlp_recall_list)
print("Avg MLP Recall Score: ", mlp_avg_recall)
#false positve and true postive rates using roc
# for i in range(len(yhat_list)):
fpr_mlp, tpr_mlp, thresholds_mlp = roc_curve(y_test[0], yhat_list[0])
#area under the curve
auc_mlp = auc(fpr_mlp, tpr_mlp)
plt.figure(figsize=(12,12))
#plot halfway line
plt.plot([0,1], [0,1], 'k--')
#plot Wide and Deep ROC
plt.plot(fpr_3, tpr_3, label='Wide and Deep (area = {:.3f})'.format(auc_3))
#plot MLP ROC
plt.plot(fpr_mlp, tpr_mlp, label='MLP (area = {:.3f})'.format(auc_mlp))
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Wide and Deep vs MLP ROC curve')
plt.legend(loc='best')
plt.show()
As seen in the graph above, the MLP and the Wide and Deep perform essentially exactly the same. This was expected as stated above, the data set we chose is truly a really well constructed data set and has a lot of categorical features to memorize. However, that being said, we still see that the wide and deep model performs better as it has a higher AUC and recall score.
weights_cat_int = []
for l in model3.layers:
if '_embed' in l.name and l.name[:-6] in feature_columns:
print(l.name)
weights_cat_int.append(l.get_weights())
weights_cat_int[-1][0]
import matplotlib.pyplot as plt
import numpy as np
n_categories = weights_cat_int[-1][0].shape[0]
# plot the mean abs value of each of the weights
mean_abs = [np.absolute(np.mean(weights_cat_int[-1][0][i:i+1])) for i in range(n_categories)]
fig = plt.figure(figsize=(10,5))
plt.title("Habitat: Magnitude of Mean Weights", fontsize=20)
plt.xlabel('Habitat type', fontsize=16)
plt.ylabel('Mean of Weights (Maginitude)', fontsize=16)
plt.bar(range(n_categories), mean_abs)
ticks = []
for i in range(n_categories):
ticks.append(encoders['habitat'].inverse_transform([i])[0])
plt.xticks(range(n_categories), ticks, ha='right')
plt.show()
As seen above, the average weights are calculated per classification type for the habitat feature. We can see that above the woods biome (d) has a high magnitude as well as the path biome (p).
This tells us that mushrooms from woodsy and path-like habitats influenced the edibility of a given mushroom the most. This is good as we anticipate a lot of our users to be going on walks through forests.
n_categories = weights_cat_int[4][0].shape[0]
# plot the mean abs value of each of the weights
mean_abs = [np.absolute(np.mean(weights_cat_int[4][0][i:i+1])) for i in range(n_categories)]
fig = plt.figure(figsize=(10,5))
plt.title("Odor: Magnitude of Mean Weights", fontsize=20)
plt.xlabel('Odor type', fontsize=16)
plt.ylabel('Mean of Weights (Maginitude)', fontsize=16)
plt.bar(range(n_categories), mean_abs)
ticks = []
for i in range(n_categories):
ticks.append(encoders['odor'].inverse_transform([i])[0])
plt.xticks(range(n_categories), ticks, ha='right')
plt.show()
In this weight visualization, we are looking at the odor type. As seen in the graph above, the most impactful odor type on our prediction task was anise (l) and fishy (y). This is also pretty interesting and good for us as the most impactful smelling mushrooms were those that smelled like fennel and those that smell fishy, two scents almost everyone is familiar with!